Natural Language Processing - Twitter US Airline Sentiment Analysis. - By David Salako.


Background and Context.

Twitter possesses 330 million monthly active users, which allows businesses to reach a broad population and connect with customers without intermediaries. On the other hand, there is so much information that it is difficult for brands to quickly detect negative social mentions that could harm their business.

Sentiment analysis/classification, which involves monitoring emotions in conversations on social media platforms, has become a key strategy in social media marketing.

Listening to how customers feel about the product/service on Twitter allows companies to understand their audience, keep on top of what is being said about their brand and their competitors, and discover new trends in the industry.


Objective.

To implement the techniques learned as a part of the course with the following learning outcomes:

* Basic understanding of text pre-processing.
* What to do after text pre-processing
* Bag of words
* Tf-idf
* Build the classification model.
* Evaluate the Model


Dataset.

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").


The dataset has the following columns:

* tweet_id                                                           
* airline_sentiment                                               
* airline_sentiment_confidence                               
* negativereason                                                   
* negativereason_confidence
* airline                                    
* airline_sentiment_gold                                              
* name     
* negativereason_gold 
* retweet_count
* text
* tweet_coord
* tweet_created
* tweet_location 
* user_timezone

There are 14,640 rows and 15 columns.

Importing the libraries.

Exploratory Data Analysis (EDA).

Observation:
The dataset has 15 columns and 14640 rows of data.

Observation:
There are lots of nulls present in the attributes:

Observation:
Null counts in the fields as stated earlier above.

Observations:

Observations:

The additional table and graph further illustrate the earlier observations regarding missing values.

Observations:

A visualization of the number of unique values in each column. Tweet_id has the most because each tweet is uniquely identified which is most likely of not much value in the model building excercise.

Observations:

Airlines arranged by number of tweets.

Observations:

Top 20 users by number of tweets.

Observations:

Top 20 user locations based on the number of tweets.

Observations:

Observations:

Year Created Distribution for Tweets about US Airlines.

Observation:

As per the description of the dataset, all the tweets were generated in 2015 and specifically in the month of February.

Tweet distribution by day.

Observations:

Hourly Tweet distribution.

Observations:

Data Pre-processing (with user-defined functions).

Data Pre-processing steps:

Observations:

Creating a Word Cloud for all the Tweets.

Observations:

Words like "flight", "thank", "help", "customer service", "bag", "time", "need", "make", and "fly" are the most frequently used in the overall dataset.

Word Cloud for Negative Tweets.

Observations:

Amongst the negative tweets, the most frequent words are "flight", "time", "help", "bag", "customer service", "go", "cancel", "flightled", "delay" etc.

Word Cloud for Positive Tweets.

Observations:

The positive tweets have frequent words apearing such as "flight", "great", "thank", "love", "awesome", "appreciate", "customer service", "fly", "amaze" etc.

Word Cloud for Neutral Tweets.

Observations:

The neutral tweets have frequent words such as "flight", "thank", "fly", "need", "please", "go", "ticket" etc.

Data Pre-processing (with Keras).

Dividing the Keras dataset into train and test.

Text cleaning.

Converting Text to Numbers:


Statistical approaches like machine learning and deep learning work with numbers. However, the data we have is in the form of text. We need to convert the textual data into the numeric form. Several approaches exist to convert text to numbers such as bag of words, TFIDF and word2vec.


To convert text to numbers, we can use the “Tokenizer” class from the “keras.preprocessing.text” library. The constructor for the “Tokenizer” class takes “num_words” as a parameter which can be used to specify the minimum threshold for the most frequently occurring words. This can be helpful since the words that occur less number of times than a certain threshold are not very helpful for classification. Next, we need to call the “fit_on_text()” method to train the tokenizer. Once we train the tokenizer, we can use it to convert text to numbers using the “text_to_matrix()” function. The “mode” parameter specifies the scheme that we want to use for the conversion. We used TFIDF scheme owing to its simplicity and efficiency. The following script converts, text to numbers.

Training the Neural Network (with Keras).

Evaluating the algorithm.

Observations:
Not a strong model performance in validation, and test. Overfitting in training. To further improve the accuracy, we can try a different number of layers, drop out, epochs, and activation.

Harnessing the TfidVectorizer to prepare the datatset for the Naive Bayes Classifier model.

Using the Naive Bayes Classifier to generate an accuracy score.

The score of 76.81% is not that great but falls in line with the test accouracy of the immediately preceding neural network fed by a keras generated dataset.

Data Pre-processing (with Tensorflow).

Observation:
The longest tweet has 30 words.

Word Embedding.

In NLP, textual data must be represented in a way that computers can work with. We will focus on word embeddings which is a representation of text where similar words have a similar representation. One model of word embedding is word2vec which takes a large corpus of text and outputs a vector space where each unique word has its own corresponding vector. In this space, words with similar meanings are located close to one another.


Another popular model is the Global Vectors for Word Representation (GloVe) which is an extension of word2vec. It generally allows for better word embeddings by creating a word-context matrix. Basically, it creates a measure to indicate that certain words are more likely to be seen in context of others. For example, the word “chip” is likely to be seen in the context of “potato” but not with “cloud”. Its developers created the embeddings using English words obtained from Wikipedia and Common Crawl data.


I will use a pre-trained word embedding, because I believe GloVe generalizes well with the dataset. The embedding space created by GloVe likely contains all the words we will encounter in our tweets, so we can use these vector representations instead of creating our own from a much more limited vocabulary set.


Building the Neural Network (with Tensorflow).

Model 1: Simple LSTM Model with regularization, increase dimensionality.

Long Short-Term Memory (LSTM):

Simple Recurrent Neural Networks (RNNs) suffer from the vanishing gradient problem which occurs when information from earlier layers disappear as the network becomes deeper. A LSTM algorithm was created to avoid this problem by allowing the neural network to carry information across multiple time steps. This means it can save important information for later use, preventing gradients from vanishing during the process. Additionally, a LSTM cell can determine what information to remove as well. Therefore, it can learn to recognize an important input and store it for the future while removing unnecessary information.

Observation:

Slight overfitting in training and 78% accuracy in testing.

Observation:

Observations:
We see in the above confusion matrices, Model 1 did an excellent job predicting a negative label when the tweet was negative but suffered more with predicting positive and neutral labels. This may be due to the fact that our training was largely comprised of negative tweets, so the model learned to give a higher probability to a negative label from this class imbalance.

Model 2: LSTM with regularization and reduced dimensionality.

Observation:

Observations:
We see in the above confusion matrices, Model 2 did an excellent job predicting a negative label when the tweet was negative but did not do as well with predicting positive and neutral labels. This may be due to the fact that our training was largely comprised of negative tweets, so the model learned to give a higher probability to a negative label from this class imbalance. Model 2 performed better than Model 1 with the positive and neutral labels but performed approximately as well as Model 1 with the negative lables.

Model 3: LSTM Layer Stacking.

Observation:

Observations:

We see in the above confusion matrices, Model 3 did an excellent job predicting a negative label when the tweet was negative but did not do as well with predicting positive and neutral labels. This may be due to the fact that our training was largely comprised of negative tweets, so the model learned to give a higher probability to a negative label from this class imbalance. Model 3 performed better than Model 2 with the positive and neutral labels but performed approximately as well as Model 2 with the negative lables.

Model 4: GRU Layer Stacking.

Observation:

Observations:

We see in the above confusion matrices, Model 4 did an excellent job predicting a negative label when the tweet was negative but did not do as well with predicting positive and neutral labels. This may be due to the fact that our training was largely comprised of negative tweets, so the model learned to give a higher probability to a negative label from this class imbalance. Model 4 performed better than Model 3 with the neutral labels, a little worse with the positive labels, but performed approximately as well as Model 3 with the negative lables.

Model 5: Reduced GRU with More Regularization.

Observation:

Observations:

We see in the above confusion matrices, Model 5 did an excellent job predicting a negative label when the tweet was negative but did not do as well with predicting positive and neutral labels. This may be due to the fact that our training was largely comprised of negative tweets, so the model learned to give a higher probability to a negative label from this class imbalance.

Model 6: Bidirectional RNN.

Recurrent Neural Network (RNN):

Unlike feedforward networks that process each input individually and independently, a RNN creates loops between each node in the neural network. This makes it particularly good for sequential data, such as text. It is able to process sequences and create a state which contains information about what the network has seen so far. This is why RNNs are useful for natural language processing, because sentences are decoded word-by-word while keeping memory of the words that came beforehand to give better context for understanding. A RNN allows information from a previous output to be fed as input into the current state. Simply put, we can use previous information to help make a current decision.


Bidirectional RNNs:

In general, RNNs tend to be order or time dependent. They process the time steps in a sequential, unidirectional order. On the other hand, a bidirectional RNN is able to process a sequence in both directions which means it may be able to pick up patterns that would not be noticed using a unidirectional model. Therefore, this type of model is able to improve performance on problems that have a chronological order.

Observation:

Observations:

We see in the above confusion matrices, Model 6 did an excellent job predicting a negative label when the tweet was negative but did not do as well with predicting positive and neutral labels. This may be due to the fact that our training was largely comprised of negative tweets, so the model learned to give a higher probability to a negative label from this class imbalance.

Summary.



image.png




References.

https://theweek.com/10things/536518/10-things-need-know-today-february22-2015


https://github.com/PacktPublishing/Python-Data-Analysis-Third-Edition/blob/master/Chapter12/.ipynb_checkpoints/Ch-12-checkpoint.ipynb


https://nlp.stanford.edu/projects/glove/


https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html


https://realpython.com/python-keras-text-classification/


https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/


https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html


Extra:

Observation:
The word "flight" is no longer present in the dataset.

Feature Generation using CountVectorizer.

Split train and test.

Classification Model Building using Logistic Regression.

Evaluate the Classification Model.

Classification using TF-IDF.

Observation:

Removing the word "flight" did not improve the accuracy scores using both Logistic Regression and TF-IDF models. It remains approximately 77%.